Virginia Tech – Professional Analysts using a Large, High-Resolution Display

VAST 2009 Challenge
Challenge 1: Badge and Network Traffic

Authors and Affiliations:

Alex Endert, Virginia Tech, aendert@cs.vt.edu [PRIMARY contact]
Christopher Andrews, Virginia Tech, cpa@cs.vt.edu
Chris North, Virginia Tech, north@cs.vt.edu [Faculty advisor]

Tool(s): (Short Answer)

Our team’s goal was to observe how professional analysts solve this challenge using a visualization on a large, high-resolution display. As we worked in conjunction with the team developing the dataset, we were aware of the solution, so our goal was not to solve the task, but to observe the process taken by professional analysts attempting to solve this challenge. We are not submitting to this contest in the traditional manner, but rather taking a different approach to this challenge by highlighting the processes professional cyber analysts take to solve a challenge such as this.

Figure 1. The large, high-resolution display, arranged in a curved setup, and totaling nearly 33 megapixels.

Rather than equipping the analysts with special purpose cyber-analytic tools, we provided Microsoft Excel to display and manipulate the raw data and the general-purpose visualization tool, Spotfire (http://www.spotfire.com). In addition, all analysis was performed using a large, high-resolution display running Windows XP. The display consists of eight 30-inch LCD panels, tiled in a 4x2 configuration totaling nearly 33 megapixels (Figure 1). The analysts were able to display all of the information relevant to the challenge without minimizing any windows due to space constraints. In addition, they also had the ability to view and interact with the visualization when enlarged to span all eight screens (Figure 2). This meant that they could physically navigate to gain an overview of the dataset, examine details, switch tasks, and rapidly consult multiple views and tools. The ability to solve this challenge visually was largely due to the use of the large, high-resolution display, as it allowed for multiple views of the data, as well as the persistence of the entirety of the dataset (e.g. the map, prox data, network data, and employee IP list). Also, when the visualization was enlarged, it allowed for enough detail to where a minimal amount of data was hidden due to aggregation techniques used with smaller displays.

Figure 2. A screenshot of the visualization showing the combined prox and IP data maximized to all displays. An enlargement shows the visualization to scale. Color was used to encoded the state of the person (red: in the building, blue: in classified, yellow: proxed out of classified).

Video:

Video Link:
vastvid.mov

ANSWERS:

MC1.1: Identify which computer(s) the employee most likely used to send information to his contact in a tab-delimited table which contains for each computer identified: when the information was sent, how much information was sent and where that information was sent.

Traffic.txt

MC1.2: Characterize the patterns of behavior of suspicious computer use. (Detailed Answer)

Our team was able to work in collaboration with the group creating the VAST 2009 dataset, so we are not submitting to this contest in the traditional manner, but rather taking a different approach to this challenge by highlighting the processes professional cyber analysts take to solve a challenge such as this. Our involvement in the creation of the dataset revealed the solution to us, allowing us to guide the analysts through the task whenever needed. The decision to perform such a guided study was to accelerate the investigation so it will fit in a two-hour session, as well as discover cyber analysts’ investigation processes. Our participants consisted of four professional cyber analysts from a large government laboratory.

Each analyst was given a two-hour session to solve this challenge. We captured their progress by use of a video recording and an automated screenshot taken every minute. We followed the study with an interview where we asked a series of questions regarding their experience, as well as their typical workspaces and tools.

Our guidance of the analysts can be categorized as three main types. First, we would periodically encourage the analysts to use the visualization. We observed three of the four analysts heavily favoring the use of “the raw data” in Excel, and would suggest to them using the visualization in conjunction with Excel. We would remind the analysts that in order to solve this challenge, one may need to make use of all of the data provided, as well as establish relationships between the separate parts. Second, we provided a quick reference for them to ask questions about the challenge. Due to the two-hour time limit, they would ask us questions such as the ones normally posted on the VAST Challenge Discussion Blog – mainly pertaining to the nature of the dataset, and assumptions that can or cannot be made. Third, when we observed an analyst spending a large amount of time pursuing a “dead end”, we would inform them to stop pursuing this aspect of the data in the interest of time. For instance, some analysts would search online repositories of “bad IPs” to see if any match the IPs of this challenge. We informed them the IPs in this study are not “real”, therefore this approach would not work. Other times,

Prior to the studies, we took the time to combining the IPLog data and the prox data in Excel, which took nearly two hours. However, for a more experienced Excel user, the combination of this data would take significantly less time. Due to this time consuming aspect, as the analysts showed interest in combining this data during their investigation, we gave them our pre-combined data (both in Spotfire and Excel) due to time constraints.

During the initial stages of each trial, it became clear that each analyst has their own personal set of queries and approaches to solving such a challenge. Each analyst started the study by performing a series of premeditated searches and questions based on their prior domain knowledge. These included queries on specific IPs, sorting by largest packets, creating pivot tables in Excel to highlight unique IP-to-IP connections, and more. Also, their background strongly dictated the tools they used. For instance, one analyst was very skilled in Excel, therefore the majority of her work was done creating different views of the data within Excel. This proved to be problematic due to the majority of her work and interactions with the data not being able to be captured in an easy way. At times, she resorted to saving versions of the Excel file in order to maintain a “working state” of the data, from where she would further explore other directions of her investigation.

The other analysts mainly worked back and forth between the visualization and the data in Excel, with one analyst doing the majority of his work within Spotfire. We believe this occurred due to his previous experience with such as tool, as he felt very comfortable with manipulating the visualization. As keeping the data between the two tools synchronized is difficult, the analysts would often use the visualization as a means for exploration and discovery, and then use the Excel file as a way to “quantify and reconfirm” what they saw. One analyst kept a separate “note file”, where interesting information was pasted from time to time.

However, moving their investigation from a textual, query-driven analysis within Excel to a visual one in Spotfire did not come easy to most analysts. Often, when we would point out something to them within the visualization, they would glance at it, and then move directly back to Excel and continue their work there. There was a clear distrust of visualizations by cyber analysts. Our post-study interview further revealed some of their thoughts on visualizations. Some commented that visualizations “hide the data” due to aggregation algorithms, others claimed they were unable to “save states of what [they were] working on”, causing them to be very tentative with their visual exploration due to fear of corrupting the current state of their investigation.

Figure 3. A screenshot showing the use of multiple views of the data through multiple instances of Spotfire and Excel running simultaneously. The analyst has the employee data, prox data, IP log data, task description, and a notes file fully visible. Notice that the majority of the interactions occur within the tools (e.g. creation of pivot tables within Excel) and do not utilize the additional screen space well.

Based on past research, we hypothesized the added display space of the large, high-resolution display would be used for: (1) showing multiple views of the data (e.g. many windows open, showing different aspects of the same data) shown in Figure 3, or (2) showing more detail of a single view (e.g. one maximized detailed visualization) shown in Figure 4. Both of these uses were observed at times throughout the study. However, we believe their use of the space could be much improved for three main reasons. First, based on their responses from the interviews, none of the analysts were familiar with using a display such as this. We have found that in order to get comfortable with such a display, one needs to use it for a week or more, not merely two hours. Second, there was a learning curve associated with the visualization tool, causing three out of four analysts to shy away from performing their investigation visually. Again, we would guide them and show them the basics of the tool to get them started, but feel they were not fully comfortable with the tool to spread out their workspace across multiple instances of Spotfire to achieve “multiple views of the data”. We only saw one analyst doing so, based heavily on our help with setting up the instances of the tool for him. Third, the tools (both the Spotfire and Excel) did not allow for a proper use of the added displays space. The analysts’ interactions (i.e. their “work”) was not captured and represented in the form of either larger, more detailed views (2) or extra views (1).

Figure 4. A screenshot showing an instance when an analyst maximized the visualization of the combined IP and prox data across all eight screens. This allowed for a fully-detailed visualization, without the aggregation which may hide data on smaller displays.

A critical point in every analyst’s investigation occurred when they made the connection between the prox and IP data. The “synchronization” of this data, a way to combine the prox information with the network data from each employee, provided a way to easily visualize where an employee is located when their assigned IP is actively sending information. However, as we found out in the interviews, cyber analysts are unaccustomed to aggregating heterogeneous data sources like this. Their job often deals with solely the network data instead of a collection of data which ties together. Some remarked that this method “had never occurred to them”. The resulting visualization of the combined data can be seen in Figure 2. We believe this realization was brought about by the persistence of the data and the corresponding views. The analysts were often observed switching between the different data during their investigation. The actual switching was simple, performed by turning their head or rotating their chair instead of accessing the task bar or re-arranging windows. Two of the four analysts were able to make this connection within the first hour of working with the data. For the other two, we guided them to consider this connection in the interest of time. For all of the analysts, we provided them with the pre-combined Excel file and corresponding visualization, as we did not want them spending time on this aspect of the challenge.

When seeing the combined data visually, the analysts were immediately able to recognize new aspects of the data. We received comments saying the data became “easier to work with”, due to the concept of an employee’s regular schedule becoming clear, including the context relative to the rest of the employees schedules. The concept of time became clear (5 weeks, each with 5 work days, and 2 days for the weekend), and other insights including the critical one: questioning what the “blue dots” mean. The “blue dots” represent an instance when an employee’s IP is actively sending information over the network while they are prox’ed into the classified area – an occurrence which should never happen.

We urged the analysts to continue working within the visualization, providing assistance on how to manipulate the visualization to show what they wanted to see. As the analysts continued down their pursuit of what the meaning of the “blue dots” is, they arrived at their conclusion: “An employee’s assigned IP sending information while they are proxed into the classified area”. When glancing back at the map of the office by a simple turn of their head, they noticed that there were no computers in the classified section, and therefore no network traffic should be seen from those employees’ IPs. Upon further visual filtering, they obtained a view of only these activities (Figure 5), showing them email traffic (port 25), web traffic (port 80), and other network traffic on port 8080.

Figure 5. A filtered scatterplot showing only network traffic from an employee's assigned IP while they are proxed into the classified area. The data includes email traffic (port 25), web traffic (port 80), and other network traffic (port 8080). The y-axis is the source IP, and the x-axis is time.

The analysts then became interested in where this information was going. At this point they each seemed to become excited at the fact they were able to tremendously narrow down the amount of data being shown. They changed their y-axis to represent the destination IP (Figure 6).

Figure 6. A filtered scatterplot showing only network traffic from an employee’s assigned IP while they are proxed into the classified area. The y-axis is the destination IP, and the x-axis is time.

Seeing all of the destination IPs arranged like this, and obtaining the details on demand by highlighting them, it became clear from here that the information was being sent to a single destination IP (100.59.151.133) from a collection of employee IPs (37.170.100.15, 37.170.100.16, 37.170.100.31, 37.170.100.41, 37.170.100.52, 37.170.100.56). Table 1 shows the traffic each of the analysts found to be sending out information. In addition, as the analysts became more experienced and comfortable with the visualization, they were able to begin quantifying their results within the visualization rather than referring back to the Excel file.

Table 1. Suspicious outgoing network traffic.

USER WARNING	SourceIP	AccessTime	DestIP	Socket	ReqSize	RespSize
Synthetic Data	37.170.100.15	2008-01-31T13:10:23.841	100.59.151.133	8080	9064720	11238
Synthetic Data	37.170.100.16	2008-01-10T16:01:53.956	100.59.151.133	8080	8543125	12312
Synthetic Data	37.170.100.16	2008-01-15T16:14:34.563	100.59.151.133	8080	6773214	24661
Synthetic Data	37.170.100.31	2008-01-10T14:27:12.238	100.59.151.133	8080	6543216	22315
Synthetic Data	37.170.100.41	2008-01-17T12:12:10.990	100.59.151.133	8080	3679122	24423
Synthetic Data	37.170.100.41	2008-01-29T16:08:10.892	100.59.151.133	8080	6752212	57865
Synthetic Data	37.170.100.52	2008-01-31T09:41:03.815	100.59.151.133	8080	5579339	22147
Synthetic Data	37.170.100.56	2008-01-29T15:41:32.763	100.59.151.133	8080	10024754	29565

The four professional cyber security analysts performed the task well. Although they were all reluctant to use the visualization at first, with our help they were each able to find the solution. After the study, they remarked how working within a visualization provided them with “interesting findings” much quicker than working within the raw data. Using the large, high-resolution display to keep all of the data visible at all times, as well as enlarge the visualization when added detail was needed, they were able to utilize the added display space by drawing connections between the different types of data, and ultimately led each of them to their solution. We feel that with proper design of future visualizations to take advantage of the added display space, cyber analytics can benefit from their inherent advantage of showing connections one would otherwise overlook.